Notebook used for submission of peer-graded assignment
First full ML project in python
The purpose of this exploratory data analysis exercise is to assess the possibility and accuracy to predict car accident severity in Seattle, USA by means of supervised machine learning models, exploiting collision track records from past accidents that were recorded by the Seattle Police Department (SPD) and provided as open data by Traffic Records.
Being able to predict car accident severity from extrenal factors like weather, location, road conditions as well as speeding, influence of alcohol/drugs etc. will allow the government to put appropriate meassures in place to reduce accident severity, but above all, allow the police and first response teams to channel their ressources and increase efficiency.
Using car accident track records from March 2013, three different machine-learning methods, namely K-Neirest Neighbours (KNN), Decision Trees, and Logistic Regressors, were benchmarked against each other
While the exploratory data analysis suggests, that almost 90% of all accidents, involving pedestrians and 88% of collisions involving cyclists lead to injuries (compare: 28% of accidents without pedestrians/cyclists lead to injuries), the tested machine learning models generally had problems to correctly predict SEVERITYCLASS=2 and therefore exhibited a high number of false negatives for this class which can in real life lead to a wrong allocation of ressources of police and first responders and potentially end deadly..
The study was carried out in October 2020.
Despite the fact that the US population has increased threefold since the beginning of the 20th century and the total number of cars cracking the 280 million mark in 2019 (source), leading to a whopping 3000 billion miles travelled p.a., fatality rates in traffice continously decline Fig. 1. This decrease of deaths in car accidents is related to measures, enforced by the law (e.g. seat belt law, 1968), advanced safety features (mandatory air bags (1998)), but also improved road safety (e.g. signs, traffic lights etc.).
Moreover, thanks to technical advances in computer technology in the last decades, efficient measures can be taken even after an accident happened, e.g. by minimizing the response time of emergency teams and police through the smart analysis of accident records.
This coding exercise demonstrates how collision records can be analysed and provide insight into predicting car accident severity using the example of Seattle, USA.
Fig. 1: Development of annual car accident fatalities in the US.
The graph shows annual US miles traveled (blue), traffic fatalities per billion vehicle miles traveled (red), per million people (orange), total annual deaths (light blue), Vehicle miles travelled (VMT) in 10s of billions (dark blue) and US population in millions (turquise) (source).
The data used in this project is was recorded by the Seattle Police Department (SPD) and provided as open data by Traffic Records. The spradsheet can be downloaded free of charge from the Seattle GeoData Portal and have been acquired since 2004.
The dataset used in this study comprises 194673 accidents, recorded and documented in Seattle in March 2013.
Apart from the target variable, the severity of the collision SEVERITYCODE, meta information from 37 additional features are available. These comprise information about location of the accident (e.g. X,Y,LOCATION, JUNCTIONTYPE, CROSSWALKKEY, SEGLANEKEY,HITPARKEDCAR, ADDRTYPE), external factors (e.g. LIGHTCOND, WEATHER, ROADCOND) or lawlessness (e.g. INATTENTIONIND,UNDERINFL, PEDROWNOTGRN, SPEEDING).
The severity of the collision is categorised as
For additional information about the individual attributes kindly refer to the official corresponding documentation that can be found here.
A first overview of the spatial distribution of car accidents in 2013 can be found below. Note the increased number of accident at intersections (zoom in).
%matplotlib inline
# import libraries
import os.path
from IPython.display import clear_output
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium import plugins
import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc
# Load the data and save data to pandas data frame
data_url = 'https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
df = pd.read_csv(data_url)
# Plot heatmap from accidents
loc = 'Density Map of Traffic Accidents, Seattle (March 2013)'
title_html = '''
<h3 align="center" style="font-size:15px"><b>{}</b></h3>
'''.format(loc)
locations = df[['Y', 'X']][df['Y'].notna()]
heatmap = folium.Map(location=[47.605, -122.3], zoom_start=11, tiles='CartoDB dark_matter')
heatmap.add_child(folium.plugins.HeatMap(locations[['Y', 'X']].values, radius=8, cmap='viridis'))
heatmap.get_root().html.add_child(folium.Element(title_html))
heatmap
The data analysis workflow is designed as follows:
1. Data Loading:
2. Data Overview:
3. Data Clean-up:
4. Exploratory Data Analysis:
5. Model Building:
(6. Model Deployment and Evaluation)
Note: Data has already been loaded to display the heat map here.{: .note}
Used libraries can be found above here.
Observations are documented in the code ore below in blue boxes.
Key variables:
df: Data frame with raw data
df_c_fin: Data frame with cleaned-up data
*_INT: Factorised object-type feature in dataframe
In order to simplify parts of the data analysis, various functions were written. This was mainly done for practicing.
# Count number of NaNs in dataframe
def df_qc(df):
missing_data = df.isnull()
df_nan = pd.DataFrame(columns=['ID', 'val', 'nan'])
for column in missing_data.columns.values.tolist():
try:
val_ok = missing_data[column].value_counts()[0]
val_nan = missing_data[column].value_counts()[1]
df_new_row = pd.DataFrame(data=np.array([[column,val_ok,val_nan]]), columns=['ID','val','nan'])
df_nan = pd.concat([df_nan,df_new_row], ignore_index=True)
except:
val_ok = missing_data[column].value_counts()[0]
df_new_row = pd.DataFrame(data=np.array([[column,val_ok,0]]), columns=['ID','val','nan'])
df_nan = pd.concat([df_nan,df_new_row], ignore_index=True)
# cast values as integer
df_nan = df_nan.astype({'val' :'int','nan':'int'})
df_nan.set_index('ID', inplace=True)
df_nan.sort_values(['nan'], ascending=True, axis=0, inplace=True)
return df_nan
# Function to plot histograms from features
def plt_hist_col(data_frame,column_name):
'''
Function to plot distribution of unique value counts of a SINGLE COLUMN in a data frame.
INPUT:
data_frame: Pandas data frame
column_name: exact column name (STR)
'''
tot_len = len(data_frame[column_name])
data_frame_perc = round((pd.Series(data_frame[column_name]).value_counts() / tot_len * 100),3)
td = data_frame_perc.to_frame()
ax1 = td.plot(kind="bar", figsize=(14,3), rot=90, width = 0.4)
ax1.set_yticks(np.arange(0, 110, 10))
ax1.set_title('Relative distribution of value counts for '+ str(column_name)+' ('+str(tot_len)+' total)', fontsize=12)
for p in ax1.patches:
ax1.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() + 2), fontsize=10, color="r")
plt.show()
return(ax1)
# Interactively QC all column data types and output list with features, that shall be removed
def df_type_qc(data_frame_r, verbose=True):
'''
Function to interactively QC all column data types and cast them into other types if required.
Returns a list del_list with column names to be deleted in next step.
INPUT:
data_frame_r: Pandas data frame
verbose=True: Output of statistics True/False (BOOLEAN)
OUTPUT:
data_frame: QCed data frame
del_list: List with column names to be deleted
'''
data_frame = data_frame_r.copy() # copy input df
hist_max = 50 # number of maximum unique values to plot histogram
del_list = [] # list with columns to be deleted
i = 0
cast = ""
len_df = len(data_frame.columns) # number of columns
missing_data = data_frame.isnull() # count missing data and create df with True/False
# iterate through all columns and check data type
# user input to cast into different data type
while i < (len_df) and cast != "x": # total number of columns
col_name = str(data_frame.columns[i]) #column name
# calculate missing data per column
val_nan = 0
val_ok = 0
try:
val_ok = missing_data[col_name].value_counts()[0]
val_nan = missing_data[col_name].value_counts()[1]
tot_count = len(data_frame[col_name])
except:
val_ok = missing_data[col_name].value_counts()[0]
tot_count = len(data_frame[col_name])
# if verbose=True output
if verbose:
#clear_output()
print(f"######################################")
print(f"{i+1}/{len_df}: {col_name}")
#print("----------STATS----------")
print(f"\ttotal:\t{tot_count}")
print(f"\tOK:\t{val_ok}\t({val_ok/tot_count*100}%)")
print(f"\tNaN:\t{val_nan}\t({val_nan/tot_count*100}%)")
print("----------HEAD----------")
print(f"{data_frame[col_name].head()}\n")
#random.randint(0,100)
if len(data_frame[col_name].value_counts()) <= hist_max:
plt_hist_col(data_frame,col_name)
print(f"x: quit\ni: cast to INT\nf: cast to FLT\no: cast to OBJ\nd: mark to DELETE (add to del_list)\nelse: skip")
cast = input(">>")
if cast == "i":
data_frame[[col_name]] = data_frame[[col_name]].astype("int64", errors='ignore')
elif cast == "f":
data_frame[[col_name]] = data_frame[[col_name]].astype("float64", errors='ignore')
elif cast == "o":
data_frame[[col_name]] = data_frame[[col_name]].astype("object", errors='ignore')
elif cast == "d":
del_list.append(col_name)
else:
pass
i += 1
return (data_frame,del_list) # return QCed df and list with column names to be deleted
# all raw data stored in dataframe df
print(f"Dimensions: {df.shape}")
df.info()
The input data contains 194673 entries and is split into 38 features of which SEVERITYCODE will be the target value to predict.
# Distribution of the target value SEVERITYCODE
feature="SEVERITYCODE"
df[feature].value_counts().to_frame()
# NaN values per feature
df_qc(df)
The table above, lists the number of NaN values per feature. Whether this means that the respective feature is False or indicates a missing value, needs to be evaluated feature by feature.
# Extract Year and Month and create new column
df_c = df.copy()
df_c["YEAR"] = int(df["INCDATE"][0][0:4])
df_c["MONTH"] = int(df["INCDATE"][0][5:7])
df_c.drop(['INCDATE'], axis=1, inplace=True)
# Run interactive function to reformat columns
df_c_qc,del_list = df_type_qc(df_c)
The interactive script allows to efficiently go through the individual features, calculate the NaN values, data type, and show the distribution of unique vales (if below 50) in a histogram plot. User input is required to decide if the datatype of the respective column shall be casted into another format or if the whole column shall be flagged to be deleted in aa subsequent step. This is done in case the entry is irrelevant for the model building (e.g. YEAR/ MONTHsince the data comprises only one month or ST_COLCO / ST_COLDESC which describe the accident) or if it contains too many NaN values.
# list with the flagged features that will be deleted
del_list = ["COLDETKEY", "REPORTNO", "STATUS", "INTKEY", "EXCEPTRSNCODE", "EXCEPTRSNDESC",
"SEVERITYCODE.1","SEVERITYDESC","INCDTTM","PEDROWNOTGRNT","SDOTCOLNUM",
"ST_COLCODE","ST_COLDESC","YEAR","MONTH"]
# delete columns based on del_list entries
df_c_qc = df_c.copy()
df_c_qc.drop(del_list, axis=1, inplace=True)
# reformat and encode individual column entries into integers for
# simple math. Hot-encoding is done in a subsequent step
feature="ADDRTYPE"
df_c_qc = df_c_qc[df_c_qc[feature].notna()]
feature="INATTENTIONIND"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].astype("int64", errors='ignore')
feature="HITPARKEDCAR"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].replace("N", "0")
df_c_qc[feature] = df_c_qc[feature].astype("int64")
feature="UNDERINFL"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].replace("N", "0")
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].astype("int64", errors='ignore')
feature="SPEEDING"
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].astype("int64", errors='ignore')
feature="WEATHER"
df_c_qc[feature] = df_c_qc[feature].replace("Other", "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Clear': '1',
'Partly Cloudy': '2',
'Overcast': '3',
'Severe Crosswind': '4',
'Raining':'6',
'Blowing Sand/Dirt':'7',
'Sleet/Hail/Freezing Rain': '8',
'Fog/Smog/Smoke':'9',
'Snowing':'10',
'Unknown':'5',
})
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64", errors='ignore')
feature="ROADCOND"
df_c_qc[feature] = df_c_qc[feature].replace("Other", "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Dry': '1',
'Sand/Mud/Dirt': '2',
'Wet': '3',
'Unknown': '4',
'Standing Water':'5',
'Snow/Slush':'6',
'Ice': '7',
'Oil':'8',
})
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64", errors='ignore')
feature="LIGHTCOND"
df_c_qc[feature] = df_c_qc[feature].replace("Other", "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace("Dark - No Street Lights", "Dark - Street Lights Off")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Daylight': '1',
'Dusk': '2',
'Dawn': '3',
'Dark - Street Lights On': '4',
'Unknown': '5' ,
'Dark - Unknown Lighting':'6',
'Dark - Street Lights Off':'7',
})
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64")
feature="LOCATION"
df_c_qc[(feature)+'_INT'] = pd.factorize(df_c_qc[feature])[0] + 1
feature="JUNCTIONTYPE"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Mid-Block (not related to intersection)': '1',
'At Intersection (intersection related)': '2',
'Mid-Block (but intersection related)': '3',
'Unknown': '4',
'At Intersection (but not related to intersection)': '5' ,
'Ramp Junction':'6',
'Driveway Junction':'7',
})
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64")
feature="ADDRTYPE"
#df_c_qc[(feature)+'_INT'] = pd.factorize(df_c_qc[feature])[0] + 1
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Block': '1',
'Alley': '2',
'Intersection': '3',
})
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64")
#plt_hist_col(df_c_qc,'ADDRTYPE_INT')
#df_c_qc.info()
#df_c_qc['ADDRTYPE'].head()
#df['ADDRTYPE_INT'].value_counts().to_frame()
plt_hist_col(df_c_qc,'WEATHER')
In this subsection, data correlations and feature relevance are being examined.
# Correlation matrix of cleaned-up dataframe
corr = df_c_qc.corr()
plt.figure(figsize = (16,6))
corr_plot = sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,cmap='RdBu_r', vmin=-1, vmax=1, annot=True)
corr_plot.set_title('Correlation Matrix', fontsize=12)
corr_plot
A first look into the correlation matrix of the features from the cleaned-up dataframe reveals:
X/ Y have a low correlation value with the other data and can therefore be discardedPEDCOUNT /CROSSWALKKEY / SEGLANEKEY) or cyclists (PEDCLCOUNT) and severity SEVERITYCODE.WEATHER_INT and ROADCOND_INTUNDERINFL and LIGHTCOND_INT suggesting drunk driving occurs more often at night.Moreover, features like OBJECTID, COLLISIONTYPE, SDOT_COLCODE and SDOT_COLCODESC are assigned by the police after the accident and will therefore be dropped.
# clean-up final dataframe
df_c_fin = df_c_qc.drop(['X','Y','OBJECTID',
'SDOT_COLCODE', 'SDOT_COLDESC', 'ADDRTYPE', 'LOCATION',
'COLLISIONTYPE'], axis=1)
feature="SEVERITYCODE"
s2_values = round(df_c_fin[feature].value_counts()[2] / (df_c_fin[feature].value_counts()[1]+df_c_fin[feature].value_counts()[2])*100,2)
print(f"Relative amount of type 2 (injuries) accidents of total dataset: {s2_values}%.")
Analysis of the relative distribution of accidents, leading to property damage only (SEVERITYCODE=1) and accidents resulting in injuries (SEVERITYCODE=2), shows that with 30%, the latter is underrepresented in the unbalanced input dataset. Consequently all resulting models will be biased towards the former outcome of a collision.
An attempt to overcome this issue is made through Synthetic Minority Oversampling below.
# Impact of features on severitycode #1: traffic offenses
target = "SEVERITYCODE"
features_law = ["UNDERINFL", "SPEEDING", "INATTENTIONIND"]
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Severitycode vs Traffic Offenses", fontsize='20')
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
#df_c_fin.plot(kind='bar', ax=ax1)
sns.catplot(x=features_law[0], kind='count', hue=target, data=df_c_fin, ax=ax1)
sns.catplot(x=features_law[1], kind='count', hue=target, data=df_c_fin, ax=ax2)
sns.catplot(x=features_law[2], kind='count', hue=target, data=df_c_fin, ax=ax3)
ax1.legend(loc='upper right',title=target)
ax2.legend(loc='upper right',title=target)
ax3.legend(loc='upper right',title=target)
plt.close(2)
plt.close(3)
plt.close(4)
for i,f in enumerate(features_law):
sev1Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 1)].shape[0]
sev1N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 0)].shape[0]
sev2Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 1)].shape[0]
sev2N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 0)].shape[0]
print(f"{f}:\n {round(sev2Y / (sev1Y+sev2Y) * 100,2)}% of accidents with {f}=1 have {target}=2.")
print(f" {round(sev2N / (sev1N+sev2N) * 100,2)}% of accidents with {f}=0 have {target}=2.")
Analysis of UNDERINFL, SPEEDING, and INATTENTIONIND vs SEVERITYCODE reveals that traffic offenses increase the risks for accidents resulting in injuries (SEVERITYCODE 2) by up to 10%.
# Impact of features on severitycode #2: pedestrians / cyclists
target = "SEVERITYCODE"
features_ped = ["PEDCOUNT", "PEDCYLCOUNT"]
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Severitycode vs Pedestrians / Cyclists", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
sns.catplot(x=features_ped[0], kind='count', hue=target, data=df_c_fin, ax=ax1)
sns.catplot(x=features_ped[1], kind='count', hue=target, data=df_c_fin, ax=ax2)
ax1.legend(loc='upper right',title=target)
ax2.legend(loc='upper right',title=target)
plt.close(2)
plt.close(3)
for i,f in enumerate(features_ped):
sev1N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 0)].shape[0]
sev1Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 1)].shape[0]
sev2N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 0)].shape[0]
sev2Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 1)].shape[0]
print(f"{f}:\n {round(sev2Y / (sev1Y+sev2Y) * 100,2)}% of accidents with {f}>=1 have {target}=2.")
print(f" {round(sev2N / (sev1N+sev2N) * 100,2)}% of accidents with {f}=0 have {target}=2.")
Analysis of PEDCOUNT and PEDCYLCOUNT vs SEVERITYCODE reveals that accidents involving pedestrians or cyclists accidents increase the risk of injuries(SEVERITYCODE 2) by up to 70%.
# Impact of features on severitycode #3: weather, daytime, road condition
target = "SEVERITYCODE"
features_ext = ["WEATHER", "ROADCOND", "LIGHTCOND"]
fig = plt.figure(figsize=(20, 8))
ax1 = sns.catplot(x=features_ext[0], kind='count', hue=target, data=df_c_fin, height=4, aspect=5)
ax1.fig.suptitle(('Impact of '+features_ext[0]+' on '+target),size=20)
ax2 = sns.catplot(x=features_ext[1], kind='count', hue=target, data=df_c_fin, height=4, aspect=5)
ax2.fig.suptitle(('Impact of '+features_ext[1]+' on '+target),size=20)
ax3 = sns.catplot(x=features_ext[2], kind='count', hue=target, data=df_c_fin, height=4, aspect=5)
ax3.fig.suptitle(('Impact of '+features_ext[2]+' on '+target),size=20)
plt.show()
# Playing around with dictionary for practice here
#initialise dictionary to count distibution of feature values per SEVERITYCODE
sev_dict = {1:[0,0],2:[0,0],3:[0,0],4:[0,0],5:[0,0],6:[0,0],7:[0,0],8:[0,0],9:[0,0],10:[0,0]}
features_ext_int = ["WEATHER_INT", "ROADCOND_INT", "LIGHTCOND_INT"]
for feature_name in features_ext_int:
print(feature_name)
#outer loop over severity code (1/2)
for sevcode in range(1,3,1):
feature_length = df_c_fin[feature_name].nunique() #count number of unique features
for feature_code in range(1,feature_length+1,1):
sev_dict[feature_code][sevcode-1] = df_c_fin[(df_c_fin["SEVERITYCODE"] == sevcode) & (df_c_fin[feature_name] == feature_code)].shape[0]
total = sev_dict[feature_code][0]+sev_dict[feature_code][1]
if sevcode == 2:
print(f" {round(sev_dict[feature_code][1] / (total) * 100,2)}% of accidents with feature code = {feature_code} have {target}=2. ({total} counts)")
Analysis of external features like WEATHER, ROADCOND, or LIGHT_COND reveal that:
WEATHER conditions. However, this result is statistically not stable, as only 5 accidents occured under this condition. Other weather conditions resulting in injuries are rainy (33.78%) and fog/smog/smoke (33.04%) and clear (32.32%). While it is logic, that reduced visibility and aquaplaning may increase accident numbers, high injury rates during clear weather seems counterintuituve at first glance. However, considering the fact that involvement of pedestrians / cyclists in accidents increases the risk of injury by around 70% and assuming that it is more likely for them to be on the streets if the weather is good, this is less surprising.ROADCOND) bare the highest risk to lead to accidents with injuries (37.5%). With only 64 counts, however, this statement is statistically not stable. Other road conditions, leading to accidents with injuries (wet, 33.23% and dry 32.24%) correlate with the weather conditions.LIGHTCOND feature shows that around 1/3 of all accidents happening during daylight, dusk, or dawn lead to injuries. Injuries are less likely to happen during night time due to the reduced number of pedestrians / cyclists on the street. The suspiciously low value of 9.55% is corresponds to the "Unknown" category. # Impact of features on severitycode #3: location
#initialise dictionary to count distibution of feature values per SEVERITYCODE
sev_dict = {1:[0,0],2:[0,0],3:[0,0],4:[0,0],5:[0,0],6:[0,0],7:[0,0],8:[0,0],9:[0,0],10:[0,0]}
features_loc = ["JUNCTIONTYPE_INT", "ADDRTYPE_INT"]
for feature_name in features_loc:
print(feature_name)
#outer loop over severity code (1/2)
for sevcode in range(1,3,1):
feature_length = df_c_fin[feature_name].nunique() #count number of unique features
for feature_code in range(1,feature_length+1,1):
sev_dict[feature_code][sevcode-1] = df_c_fin[(df_c_fin["SEVERITYCODE"] == sevcode) & (df_c_fin[feature_name] == feature_code)].shape[0]
total = sev_dict[feature_code][0]+sev_dict[feature_code][1]
if sevcode == 2:
print(f" {round(sev_dict[feature_code][1] / (total) * 100,2)}% of accidents with feature code = {feature_code} have {target}=2. ({total} counts)")
Analysis of data features JUNCTIONTYPE_INT and ADDRTYPE_INT, describing the location of the collision shows that most accidents leading to injuries are intersection related (43.27% and 42.75%, respectively). The features contain similar information. Due to the higher granularity, JUNCTIONTYPE_INT will be kept.
This subsection is split into two parts: 1) augmenting and encoding data and 2) model testing.
One-Hot Encoding is a process in the data processing that is applied to categorical data, to convert it into a binary vector representation for use in machine learning algorithms.
This method is required to use categorical data in ML techniques as they assume natural ordering between the encoded integers, which results in a poor model performance.
For the modelling, the following features will be one-hot encoded: JUNCTIONTYPE, LIGHTCOND, WEATHER, ROADCOND.
More information can be found here.
# One-Hot encode categorical data of cleaned-up dataframe df_c_fin
df_weather = pd.get_dummies(df_c_fin.WEATHER, prefix="WEATHER")
df_roadcond = pd.get_dummies(df_c_fin.ROADCOND, prefix="ROADCOND")
df_junction = pd.get_dummies(df_c_fin.JUNCTIONTYPE, prefix="JUNCTIONTYPE")
df_light = pd.get_dummies(df_c_fin.LIGHTCOND, prefix="LIGHTCOND")
df_1hot = pd.concat([df_light, df_junction, df_roadcond, df_weather], axis=1)
#create list
features_1hot = list(df_1hot.columns)
# Merge dataframes and define feature list
df_fin = pd.concat([df_c_fin, df_1hot], axis=1)
# create list with encoded features to be used for model building
target = "SEVERITYCODE"
features_loc = ["JUNCTIONTYPE_INT"]
features_ext = ["LIGHTCOND_INT", "WEATHER_INT", "ROADCOND_INT"]
features_ped = ["PEDCOUNT", "PEDCYLCOUNT", "VEHCOUNT"]
features_law = ["SPEEDING", "UNDERINFL", "INATTENTIONIND"]
features_final = features_law + features_1hot
# Split data set into training and testing data
split_ratio = 0.4
X = df_fin[features_final]
Y = df_fin[target]
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=split_ratio,random_state=0)
print(f"----TRAINING / TESTING DATA CREATED----")
print(f"Training Data: {x_train.shape},{y_train.shape}")
print(f"Testing Data: {X_test.shape} x {Y_train.shape}")
print(f"Ratio: {split_ratio}")
os = SMOTE(random_state=1)
#x_train_SMOTE, y_train_SMOTE = os.fit_sample(x_train, y_train)
print(f"----TRAINING DATA AUGMENTED for {target}----")
print(f"Relative Distribution for {target}\nInput Training Data x_train / y_train")
print(y_train.value_counts()/len(y_train))
print('')
print(f"Relative Distribution for {target}\nAugmented Training Data x_train_SMOTE / y_train_SMOTE")
print(pd.Series(y_train_SMOTE).value_counts()/len(y_train_SMOTE))
KNN is a non-parametric, supervised machine learning algorithm used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output of the model is a class membership.
An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. For k > 1 the label, which is most frequent among the k training samples nearest to that query point is assigned to the query / test point.
Since this algorithm relies on distance for classification, normalisation of input data is important (reference).
Note that KNNs have a high prediction cost for high-dimensional data, due to calculating the Euclidean distance in the multi-dimensional feature space.
Fig. 2: Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). (source).
# KNN classifier for unbalanced, encoded input dataset x_train, y_train, x_test, y_test
Ks = 8 # number of iterations
conf_mat = []
report = []
Yhat = []
f1_weight_avg = []
for n in range(1,Ks+1):
classifier = KNeighborsClassifier(n_neighbors = n).fit(x_train, y_train)
Yhat.append(classifier.predict(x_test))
conf_mat.append(confusion_matrix(y_test, Yhat[n-1]))
report.append(classification_report(y_test, Yhat[n-1]))
f1_weight_avg.append(f1_score(y_test, Yhat[n-1], average='weighted'))
# get results for maximum mean_accuracy score:
Ks_best = np.argmax(f1_weight_avg)+1
conf_mat_best = conf_mat[Ks_best-1]
report_best = report[Ks_best-1]
Yhat_best = Yhat[Ks_best-1]
print(report_best)
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("KNN Model QC", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Qualitative assessment of the quality of the model by comparing the value distribution
# of the real data with the predicted values.
ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_best, hist=True, color="b", label="Fitted Values", ax=ax1)
ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | KNN: Unbalanced data K='+str(Ks_best))
ax1.legend()
# Weighted F1 score development with K
ax2 = plt.plot(range(1,Ks+1),f1_weight_avg)
plt.title('Weighted F1 score vs K value | KNN: Unbalanced data K='+str(Ks_best))
plt.ylabel("F1 weighted avg")
plt.xlabel("K")
plt.show()
The classification report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes:
1) Precision: Accuracy of positive predictions.
Precision = True Positives/(True Positives + False Positives)
2) Recall: Ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives (Fraction of positives that were correctly identified).
Recall = True Positives / (True Positives + False Negatives)
3) F1 score: This is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
4) Support is the number of occurrences of each class in the correct target values.
5) Accuracy: Fraction of predictions the model got right. Accuracy = Number of correct predictions / Total number of predictions
6) Macro F1 calculates the F1 separated by class but not using weights for the aggregation which resuls in a bigger penalisation when your model does not perform well with the minority classes.
7) Weighted F1 score calculates the F1 score for each class independently. When it adds F1 scores per class together, it uses a weight that depends on the number of true labels of each class.
More info can be found here.
# KNN classifier for balanced, encoded input dataset x_train_SMOTE, y_train_SMOTE, x_test, y_test
Ks = 8 # number of iterations
conf_mat_SMOTE = []
report_SMOTE = []
Yhat_SMOTE = []
f1_weight_avg_SMOTE = []
for n in range(1,Ks+1):
classifier_SMOTE = KNeighborsClassifier(n_neighbors = n).fit(x_train_SMOTE, y_train_SMOTE)
Yhat_SMOTE.append(classifier_SMOTE.predict(x_test))
conf_mat_SMOTE.append(confusion_matrix(y_test, Yhat_SMOTE[n-1]))
report_SMOTE.append(classification_report(y_test, Yhat_SMOTE[n-1]))
f1_weight_avg_SMOTE.append(f1_score(y_test, Yhat_SMOTE[n-1], average='weighted'))
# get results for maximum mean_accuracy score:
Ks_best_SMOTE = np.argmax(f1_weight_avg_SMOTE)+1
conf_mat_best_SMOTE = conf_mat_SMOTE[Ks_best_SMOTE-1]
report_best_SMOTE = report_SMOTE[Ks_best_SMOTE-1]
Yhat_best_SMOTE = Yhat_SMOTE[Ks_best_SMOTE-1]
print(report_best_SMOTE)
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("KNN Model QC SMOTE", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Qualitative assessment of the quality of the model by comparing the value distribution
# of the real data with the predicted values.
ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_best_SMOTE, hist=True, color="b", label="Fitted Values", ax=ax1)
ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | KNN: SMOTE-Balanced data K='+str(Ks_best_SMOTE))
ax1.legend()
# Weighted F1 score development with K
ax2 = plt.plot(range(1,Ks+1),f1_weight_avg_SMOTE)
plt.title('Weighted F1 score vs K value | KNN: SMOTE-Balanced data K='+str(Ks_best_SMOTE))
plt.ylabel("F1 weighted avg")
plt.xlabel("K")
plt.show()
The high precision and recall value for class 1 suggests, that the model is good in predicting SEVERITYCODE=1 accidents. The low values for SEVERITYCODE=2, however suggest that most collisions, resulting in injuries will be predicted incorrectly.
A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The top most node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value and partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps decision making.
It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret. More information can be found here.
# AUC is a good way for evaluation for this type of problems.
max_depths = np.linspace(3, 21, 19, endpoint=True)
report_tree = []
f1_weight_avg_tree = []
train_results = []
test_results = []
Yhat_tree_list = []
for n in max_depths:
n=int(n)
clf = DecisionTreeClassifier(max_depth=n, criterion="entropy",
max_features=None, max_leaf_nodes=None,
min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=None, splitter="best")
clf = clf.fit(x_train,y_train)
Yhat_tree = clf.predict(x_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train-1, Yhat_tree-1)
roc_auc = auc(false_positive_rate, true_positive_rate)
train_results.append(roc_auc)
Yhat_tree_test = clf.predict(x_test)
Yhat_tree_list.append(Yhat_tree_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test-1, Yhat_tree_test-1)
roc_auc = auc(false_positive_rate, true_positive_rate)
test_results.append(roc_auc)
report_tree.append(classification_report(y_test, Yhat_tree_test))
f1_weight_avg_tree.append(f1_score(y_test, Yhat_tree_test, average='weighted'))
test_results_best_tree = np.argmax(test_results)+1 #iteration number (= list index + 1)
depth_tree_max = max_depths[test_results_best_tree-1] # tree depth value absolute for maximum value
report_best_tree = report_tree[test_results_best_tree-1]
Yhat_tree_test_best = Yhat_tree_list[test_results_best_tree-1]
print(depth_tree_max)
print(report_best_tree)
The high precision and recall value for class 1 suggests, that the model is good in predicting SEVERITYCODE=1 accidents. The low values for SEVERITYCODE=2, however suggest that most collisions, resulting in injuries will not be predicted as such.
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Decision Tree Model QC", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Qualitative assessment of the quality of the model by comparing the value distribution
# of the real data with the predicted values.
ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_tree_test_best, color='b', hist=True, label="Fitted Values", ax=ax1)
ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | Decision Tree: Best depth='+str(depth_tree_max))
ax1.legend()
ax2 = plt.plot(max_depths, train_results, 'k', label="Train AUC")
ax2 = plt.plot(max_depths, test_results, 'b', label="Test AUC")
ax2 = plt.plot(depth_tree_max,test_results[test_results_best_tree -1], 'X', color='k')
ax2 = plt.text(depth_tree_max-1,test_results[test_results_best_tree -1]+0.0005, ('maximum AUC score'), color='k')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.title('Area Under Curve (AUC) Score vs Decision Tree Depth | Best Depth for n='+str(depth_tree_max))
plt.show()
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
More information can be found here.
C = np.linspace(1,10,10, endpoint=True) #Inverse of regularization strength
report_logreg = []
f1_weight_avg_logreg = []
train_results = []
test_results = []
Yhat_logreg_list = []
for n in C:
n=int(n)
clf = LogisticRegression(C=n, solver='liblinear',penalty='l2')
clf = clf.fit(x_train,y_train)
Yhat_logreg = clf.predict(x_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train-1, Yhat_logreg-1)
roc_auc = auc(false_positive_rate, true_positive_rate)
train_results.append(roc_auc)
Yhat_logreg_test = clf.predict(x_test)
Yhat_logreg_list.append(Yhat_logreg_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test-1, Yhat_logreg_test-1)
roc_auc = auc(false_positive_rate, true_positive_rate)
test_results.append(roc_auc)
report_logreg.append(classification_report(y_test, Yhat_logreg_test))
f1_weight_avg_logreg.append(f1_score(y_test, Yhat_logreg_test, average='weighted'))
test_results_best_logreg = np.argmax(test_results)+1 #iteration number (= list index + 1)
depth_logreg_max = C[test_results_best_logreg-1] # tree depth value absolute for maximum value
report_best_logreg = report_logreg[test_results_best_logreg-1]
Yhat_logreg_test_best = Yhat_logreg_list[test_results_best_logreg-1]
print(depth_logreg_max)
print(report_best_logreg)
Similar to the other models, the high precision and recall value for class 1 suggests, that the model is good in predicting SEVERITYCODE=1 accidents. The low values for SEVERITYCODE=2, however suggest that most collisions, resulting in injuries will not be predicted as such.
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Logistic Regression Model QC", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Qualitative assessment of the quality of the model by comparing the value distribution
# of the real data with the predicted values.
ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_logreg_test_best, color='b', hist=True, label="Fitted Values", ax=ax1)
ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | Decision Tree: Best inverse of regularization strength C='+str(depth_logreg_max))
ax1.legend()
ax2 = plt.plot(C, train_results, 'k', label="Train AUC")
ax2 = plt.plot(C, test_results, 'b', label="Test AUC")
ax2 = plt.plot(depth_logreg_max,test_results[test_results_best_logreg -1], 'X', color='k')
ax2 = plt.text(depth_logreg_max-1,test_results[test_results_best_logreg], ('maximum AUC score'), color='k')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('C')
plt.title('Area Under Curve (AUC) Score vs C | Best inverse of regularization strength C='+str(depth_logreg_max))
plt.show()
This modelling exercise tried to predict traffic accident severity (separating between property and physical damage) using supervised machine learning techniques on a dataset from Seattle, USA. Different modelling techniques were tested, using the same cleaned-up dataset containing 46 encoded features and 192747 entries before and 238461 entries after applying Synthetic Minority Oversampling Technique on the training dataset (SMOTE). Both pre-conditioning steps were required as the dataset was imbalanced and mostly contained categorical data. Due to the limited amount of data, 60% of the input data was used for training and 40% for testing. The dataset was acquired in March 2013.
Initial exploratory data analysis revealed, that 89.85% of accidents involving pedestrians and 87.53% of collisions involving cyclists lead to injuries (SEVERITYCODE = 2). This high level of probability could not be increased by involving additional features into the model training. The active participation of bikes and pedestrians in day-to-day traffice poses the highest risk to injuries. The risk of physical damage is further increased at junctions.
Three different algorithms were tested in order to predict traffic accident severity in collisions: K-Neirest Neighbours (KNN), Decision Tree, and Logistic Regression.
KNN models were tested on both, the original and the augmented dataset for comparison. For assessing the quality of the model, the weighted F1 score and the distribution plot was used to assess the number of false positives.
precision recall f1-score support
1 0.74 0.80 0.77 54069
2 0.42 0.35 0.38 23030
accuracy 0.66 77099
macro avg 0.58 0.57 0.57 77099
weighted avg 0.65 0.66 0.65 77099
With high precision and recall values of 0.74 and 0.80 for SEVERITYCLASS=1 and 0.42 and 0.35 for SEVERITYCLASS=2, respectively for K=8, this model performs best amongst the models tested.
Using the SMOTE-augmented training dataset for training purposes, yields comparable results.
precision recall f1-score support
1 0.73 0.83 0.78 54069
2 0.42 0.29 0.34 23030
accuracy 0.67 77099
macro avg 0.58 0.56 0.56 77099
weighted avg 0.64 0.67 0.65 77099
While the distribution plot shows a relatively good distribution of predicted SEVERITYCLASS values, the confusion matrix reveals a large number of false positives for both cases. Practically, this implies that, accidents, resulting in an injury will be misclassified as property damage only which can potentially end up fatally when ressources are allocated incorrectly.
In addition to the KNN classifier, decision trees were tested to reliably predict SEVERITYCLASS for accidents in Seattle. This evaluation was conducted on the non-augmented training data.
As the classification report below suggests, the model successfully predicts SEVERITYCLASS=1 incidents, however, the low recall value for SEVERITYCLASS=2 indicates a large number of false negatives and therefore a low F1 score for that class.
precision recall f1-score support
1 0.70 0.98 0.82 54069
2 0.45 0.03 0.06 23030
accuracy 0.70 77099
macro avg 0.58 0.51 0.44 77099
weighted avg 0.63 0.70 0.59 77099
In addition to the weighted f1 score for model evaluation, the Area Under Curve (AUC) score was tested, which yields the best results for an inverse of regularization strength of 17.
The distribution plot indicates, that the model overpredicts SEVERITYCLASS=1, but fails to reliably predict SEVERITYCLASS=2.
Last but not least, logistic regressors were tested to predict SEVERITYCLASS for accidents in Seattle from the labeled input dataset.
As the classification report below suggests, the model performs slightly better than the decision tree and manages to accurately predict SEVERITYCLASS=1 incidents, however, the low recall value for SEVERITYCLASS=2 indicates a large number of false negatives and therefore a relatively low F1 score for that class.
precision recall f1-score support
1 0.71 0.96 0.82 54069
2 0.46 0.07 0.12 23030
accuracy 0.70 77099
macro avg 0.58 0.52 0.47 77099
weighted avg 0.63 0.70 0.61 77099
In addition to the weighted f1 score for modele valuation, the Area Under Curve (AUC) score was tested, which yields the best results for a decision tree depth of 3.
The model suffers from the same issues as the decision tree and overpredicts SEVERITYCLASS=1, but fails to reliably predict SEVERITYCLASS=2.
This Capstone project served to provide an insight into the lifecycle of a typical data scientific project and offered the possibility to apply the previously learned theory on a real-life example by means of exploratory data analysis and testing and benchmarking various machine learning models against each other with the ultimate goal to predict accident severity from meta data, acquired by the the Seattle Police Department (SPD).
In total, three different methods were tested against each other to classify the SEVERITYCLASS, namely K-Neirest Neighbours (KNN), Decision Trees, and Logistic Regressors.
While the exploratory data analysis suggests, that almost 90% of all accidents, involving pedestrians and 88% of collisions involving cyclists lead to injuries (compare: 28% of accidents without pedestrians/cyclists lead to injuries), the tested machine learning models generally had problems to correctly predict SEVERITYCLASS=2 and therefore exhibited a high number of false negatives for this class which can in real life lead to a wrong allocation of ressources of police and first responders and potentially end deadly.
The best method tested was the KNN.
LIGHTCOND, WEATHER, ROADCOND).SEVERITYCODE=1) and accidents resulting in injuries (SEVERITYCODE=2), shows that with 30%, the latter is underrepresented in the unbalanced input dataset. This was compensated for by augmenting additional data using SMOTE.